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ABSTRACT 

This  paper  describes  Carnegie  Mellon  University’s  entry  at 
the  TREC  2014  Federated  Web  Search  track  (FedWebl4). 
Federated  search  pipelines  typically  have  two  components: 

(i)  resource-selection,  and  (ii)  result-merging.  This  work 
documents  experiments  to  modify  queries  to  merge  results 
in  the  federated-search  pipeline.  Approaches  from  previous 
attempts  at  solving  this  problem  involved  custom  query- 
document  similarity  scores  or  rank-combination  methods. 

In  this  document,  we  explore  how  term-dependence  models 
and  query  expansion  strategies  influence  result-merging. 
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I.  INTRODUCTION 

Federated  search  deals  with  the  problem  of  aggregating 
results  from  multiple  search  engines.  The  invidual  search 
engines  are  (i)  typically  focused  on  a  particular  domain  or  a 
particular  corpus,  (ii)  employ  diverse  retrieval  models,  and 
(iii)  do  not  necessarily  expose  statistics  used  in  information 
retrieval  algorithms. 

The  problem  of  federated  search  thus  involves  (i)  analyz¬ 
ing  a  query  to  determine  which  search  engines  are  appropri¬ 
ate  for  addressing  the  information  need  ( resource  selection), 
and  (ii)  merging  the  results  returned  by  each  of  these  engines 
( result  merging). 

The  TREC  Federated  Web  Search  Track  is  a  setting  for 
evaluating  approaches  to  federated  search.  The  FedWebl4 
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track  contained  three  components:  (i)  vertical  selection,  (ii) 
resource  selection,  (iii)  results  merging.  Vertical  selection 
involves  predicting  the  quality  of  verticals  (like  sports  and 
news)  for  a  query.  Resource  selection  involves  ranking  the 
available  search  engines  given  a  particular  query.  Result 
merging  involves  mixing  results  from  a  few  chosen  resources 
for  a  given  query.  A  typical  system  usually  leverages  vertical 
selection  and  resource  selection  for  producing  a  ranked  list 
of  resources  (search  engines).  Then,  the  query  is  issued  to  a 
few  highly-ranked  resources  and  the  documents  returned  by 
these  resources  are  merged  in  the  result-merging  phase. 

In  this  work,  we  focus  on  the  result-merging  phase  of 
federated-search  systems.  In  particular,  we  explore  some 
techniques  that  either  modify  or  expand  the  query-terms 
to  improve  performance  on  result  merging  tasks.  Among 
the  approaches  implemented,  we  leverage  term-dependence 
models  and  neural  network  word  embeddings. 

In  the  following  sections,  we  describe  existing  approaches, 
the  methods  implemented  in  this  work  and  an  evaluation  of 
the  methods. 

2.  RELATED  WORK 

Federated  search  is  a  well-explored  problem  in  information 
retrieval  research.  The  subproblems  of  resource-selection 
and  result-merging  have  been  well  studied  in  the  past.  Shok- 
ouhi  &  Si  [16]  presented  a  comprehensive  survey  of  tech¬ 
niques  in  federated  search.  Si  &  Callan  [18]  presented  a 
semi-supervised  approach  to  result  merging  in  that  used  the 
documents  acquired  by  query-based  sampling  as  training 
data  and  linear  regression  to  learn  the  resource  and  query- 
specific  merging  models.  Shokouhi  &  Zobel  [17]  presented  a 
technique  for  using  documents  sampled  from  a  resource  for 
estimating  the  global  scores  of  documents  for  a  query. 

A  set  of  more  than  150  real  world  search  engines  and 
query-based  samples  from  each  were  provided  in  the  TREC 
FedWebl3,  2013.  Several  approaches  were  employed  for 
merging  results.  For  instance,  Mourao  et  al  [13]  presented 
approaches  that  combined  several  rank-combination  tech¬ 
niques.  Di  Buccio  et  al  [6]  presented  a  round-robin  approach 
for  merging  results.  At  TREC  2013,  Guang  et  al  [7],  Bellogin 
et  al  [1]  and  Pal  et  al  [14]  showed  approaches  where  global 
document  scores  for  ranking  were  produced  using  a  combi¬ 
nation  of  query-document  similarity  measures  that  at  times 
included  scores  assigned  to  resources.  However,  several  of 
the  successful  methods  assumed  that  (i)  the  entire  set  of 
documents  retrieved  from  the  selected  resources  were  avail¬ 
able  during  the  result-merging  phase,  and  (ii)  documents 
retrieved  by  the  resources  were  available  for  indexing  and 
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searching.  In  a  typical  setting,  these  assumptions  are  not 
necessarily  valid. 

Word  embeddings  are  mappings  from  a  word  to  a  vector 
which  typically  belongs  to  a  continuous  vector  space.  These 
embeddings  allow  us  to  reason  about  the  syntactic  and  se¬ 
mantic  words  through  linear  algebra  operations  like  simi¬ 
larity  and  distance  functions.  Word-embeddings  have  been 
extensively  studied  in  the  past.  One  of  the  earliest  works  in 
this  area  was  the  LSI  algorithm  by  Deerwester  et  al  [4] .  The 
LSI  algorithm  constructed  word  embeddings  from  a  Term 
x  Document  matrix  using  a  dimension-reduction  operation. 
Recent  approaches  for  constructing  these  embeddings  have 
leveraged  neural  networks  extensively.  Bengio  et  al  [2,  3] 
demonstrated  the  benefits  of  using  these  embeddings  in  lan¬ 
guage  modeling.  Embeddings  produced  by  neural  networks 
have  provided  significant  gains  over  the  state-of-the-art  ap¬ 
proaches  in  several  natural  language  processing  (NLP)  appli¬ 
cations  like  sentiment  classification,  word  clustering  and  so 
on.  Mikolov  et  al  [11,  12])  produced  word  embeddings  that 
have  been  used  in  other  NLP  tasks.  Our  paper  leverages  the 
embeddings  from  Mikolov  et  al  [12]  to  augment  queries  with 
additional  terms  and  for  weighting. 

Metzler  &  Croft  [9]  modeled  the  dependencies  between 
query-terms  in  a  query  and  demonstrated  that  these  queries 
produced  performed  better  than  a  query  that  used  individ¬ 
ual  query  terms.  In  this  paper,  we  use  the  sequential  depen¬ 
dence  model  (SDM)  for  modeling  term  dependencies. 

Term  weighting  approaches  have  been  used  extensively  in 
information  retrieval  systems.  Robertson  &  Zaragoza  [15] 
provide  a  survey  of  probablistic  models  -  a  few  of  which  con¬ 
tain  term- weighting  schemes  (like  the  BM25  model).  In  this 
paper,  we  use  the  word-embedding  from  [12]  for  weighing 
terms. 

3.  APPROACH 

In  a  federated  search  pipeline,  the  result-merging  task  fol¬ 
lows  a  resource-selection  step.  The  resource-selection  step 
returns  a  ranked-list  of  resources.  The  query  (or  a  trans¬ 
formation)  is  issued  to  a  few  of  the  top-ranked  resources 
and  the  results  from  all  these  resources  are  combined  in  the 
resource-selection  phase. 

In  FedWebl4,  only  the  snippets  returned  by  each  of  the 
resources  were  provided.  Thus,  the  following  approaches  op¬ 
erate  on  an  index  of  all  the  snippets  returned  by  the  top  re¬ 
sources  for  a  query.  In  the  following  subsections  we  describe 
the  various  methods  implemented. 

3.1  Unstructured  Queries 

Each  of  the  documents  (whose  snippets  we  have  access 
to)  is  ranked  using  a  classic  retrieval  model  -  Language- 
Modeling  with  Dirichlet  smoothing  [19]. 

3.2  Sequential  Dependence  Model  Queries 

Sequential-dependence  models  assume  a  dependence  be¬ 
tween  neighboring  query  terms.  Essentially,  the  similarity 
between  a  query  and  a  document  is  measured  as  a  weighted 
combination  of  (i)  a  unigram  score  (each  term  individually), 
(ii)  an  exact-match  bigram  score,  and  (iii)  an  unordered- 
window  bigram  score.  Table  1  shows  an  example  of  a  query 
and  its  sequential  dependence  variant. 

In  this  approach,  for  each  query,  the  new  rankings  of  all 
the  snippets  are  given  by  the  scores  obtained  from  executing 


Query 

burning  man  tickets 
SDM  Query  Indri  Expression 
#weight ( 

Ai  #combine (burning  man  tickets) 

A2  #combine(  #1 (burning  man)  #l(man  tickets)) 

A3  #combine(  #uw8 (burning  man)  #uw8(man  tickets))) 


Table  1:  An  example  of  a  query  and  the  associ¬ 
ated  indri  expression  for  the  sequential  dependency 
model  (SDM)  query. 

the  sequential-dependence  query  on  the  index.  The  retrieval 
model  employed  is  the  standard  Indri  retrieval  model  [19]. 

3.3  Expanding  Using  Word  Embeddings 

Word-embeddings  are  a  mapping  from  words  to  a  vector 
space.  These  embeddings  often  capture  and/or  preserve  lin¬ 
guistic  properties  of  words.  This  allows  scores  or  probabili¬ 
ties  that  are  computed  for  a  term  to  be  applied  to  a  semanti¬ 
cally  similar  term.  A  brief  introduction  to  continuous  vector 
representations  is  provided  below  and  the  query-expansion 
strategies  used  are  described  after. 

Bengio  et  al  [2]  proposed  the  use  of  continuous  representa¬ 
tions  of  words  for  language  modeling.  The  intuition  behind 
this  approach  was  that  these  embeddings  could  capture  se¬ 
mantic  similarities  between  words  and  thus  help  overcome 
data  sparsity  issues  in  language  modeling  tasks.  For  ex¬ 
ample,  if  the  sentence  the  cat  is  walking  in  the  room  is 
observed  in  the  training  corpus,  then  the  evidence  gath¬ 
ered  from  this  example  must  generalize  to  a  sentence  like 
the  dog  is  walking  in  the  house.  Data-sparsity  issues 
can  lead  to  the  latter  sentence  having  zero  evidence. 

Generating  a  continuous  vector  representation  for  each 
word  allows  us  to  transfer  evidence  from  the  term  cat  to  the 
term  dog  and  from  room  to  house.  The  representations  for 
(semantically)  similar  terms  are  thus  expected  to  be  similar. 
Several  approaches  have  been  studied  to  construct  such  rep¬ 
resentations.  Neural  network  based  language  models  aim  to 
learn  these  representations  and  a  statistical  language  model 
for  the  underlying  text.  These  models  mainly  belong  to  two 
categories  described  below. 

•  Models  that  learn  the  word  representations  and  the 
language  model  jointly.  The  language  model  described 
in  [2]  falls  in  this  category. 

•  Models  that  learn  the  word  vector  representations  first 
and  then  train  the  language  model  with  the  word  vec¬ 
tors.  These  models  are  computationally  easier  to  con¬ 
struct. 

The  Continuous  Bag-of- Words  model  and  the  Continuous 
Skip-gram  model  proposed  by  Mikolov  et  al  in  [10],  belong 
to  the  latter  category.  Word  vector  representations  on  a 
Google  News  corpus  with  100  billion  words  for  a  vocabulary 
of  3  million  words  can  be  learned  in  less  than  one  day  us¬ 
ing  modest  hardware.  Word  vectors  learned  by  both  models 
have  performed  well  in  several  semantic  related  task  evalu¬ 
ations  as  shown  in  [10].  An  example  is  shown  in  Table  2.  I11 
this  example,  the  top  5  words  close  to  the  word  france  are 
displayed.  It  is  clear  that  the  retrieved  words  are  semanti¬ 
cally  similar  (at  least  for  this  example). 


Word 

Cosine  Similarity 

Spain 

0.678515 

belgium 

0.665923 

netherlands 

0.652428 

italy 

0.633130 

Switzerland 

0.622323 

Table  2:  Five  words  most  similar  to  the  word  france. 

3.4  Query  Expansion  Strategies 

We  used  two  approaches  to  augment  a  query  with  addi¬ 
tional  terms.  These  approaches  find  additional  terms  that 
are  either  (i)  similar  to  the  query  as  a  whole,  or  (ii)  similar  to 
individual  terms  in  a  query.  In  both  approaches,  the  terms 
retrieved  for  a  query  are  similar  to  a  vector  (which  repre¬ 
sents  terms  or  a  query  aggregate).  For  computing  a  vector 
that  represents  the  entire  query,  we  obtain  the  vectors  for 
each  of  the  terms  and  compute  the  mean  vector. 

Thus,  in  the  first  expansion  strategy,  we  add  a  few  terms 
to  the  query  that  are  closest  to  the  query  mean  vector.  In  the 
second  strategy,  for  each  term,  we  retrieve  additional  words 
closest  to  the  vector.  In  both  cases,  the  terms  added  are 
obtained  from  the  global  vocabulary  of  the  word  embeddings 
available. 

Once  the  additional  terms  are  added  to  the  query,  the 
snippets  are  scored  based  on  this  newer  query  using  the 
language-model  with  dirichlet  smoothing  retrieval  model. 

3.5  Term  Weighting  Strategies 

Our  approach  uses  word-embeddings  from  [10]  to  produce 
weights  for  individual  query-terms.  We  use  two  strategies  to 
weigh  terms.  In  both  cases,  the  weights  applied  to  each  term 
are  the  distance  between  the  term’s  embedding  and  a  certain 
global  vector.  The  distance  metric  is  euclidean  distance  in 
both  approaches.  The  intuition  behind  using  distance  from 
a  vector  is  that  the  farther  a  term  is  from  the  global  vector, 
the  more  information  it  contains  and  thus  it  merits  a  higher 
weight. 

In  the  first  of  these  approaches,  the  global  vector  used  is 
the  query  mean  vector  obtained  by  averaging  the  vectors 
corresponding  to  the  terms  in  the  query.  In  the  second  ap¬ 
proach,  the  global  vector  used  is  the  average  vector  of  the 
entire  vocabulary  of  the  learned  embeddings  (close  to  3  mil¬ 
lion  words  and  phrases). 

4.  EVALUATION 

In  this  section,  we  elaborate  on  the  FedWebl3  and  Fed- 
Webl4  collections  and  present  the  results. 

4.1  Data  collections 

FedWebl3  contains  results  sampled  from  from  157  real 
world  search  engines  in  24  verticals.  2000  queries  were  issued 
to  the  search  engines  during  the  sampling  phase.  See  Table 
3  for  a  summary  of  data  statistics  of  FedWebl3.  In  this 
paper,  we  report  experiments  and  analysis  on  the  FedWebl3 
dataset  since  as  of  this  paper,  the  tools  for  evaluating  on 
FedWebl4  have  not  yet  been  released. 

The  data  collection  of  FedWebl4  is  built  from  149  web 
search  engines  crawled  between  April  and  May  2014.  4000 
queries  were  issued  to  the  search  engines  in  the  sampling 
phase.  Table  4  presents  the  data  statistics  of  FedWebl4. 
For  the  result  merging  phase,  only  snippets  were  provided. 


Samples 
(2000  Queries) 

Snippets 

Pages 

Total 

1,973,591 

1,894,463 

Per  Engine 

12,570.6 

12,066.6 

Topics 

Snippets 

Pages 

143,298 

136,103 

912.7 

866.9 

Table  3: 

FedWebl3  collection  statistics. 

Samples 
(4000  Queries) 

Snippets 

Pages 

Total 

1,422,758 

3,471,773 

Per  Engine 

9548.7 

23300.5 

Topics 

Snippets 

Pages 

51458 

0 

345.3 

0 

Table  4:  FedWebl3  collection  statistics. 


Only  the  results  provided  by  the  organizers  are  provided  for 
the  FedWebl4  dataset  since  tools  for  performing  a  per-query 
analysis  are  not  yet  released  (as  of  this  paper). 

4.2  Experimental  setup 

We  use  the  Indri  search  engine  to  index  and  search  the 
snippets  for  each  search  engine.  Stop-word  removal  and 
stemming  did  not  aid  system  performance  significantly  and 
since  the  embeddings  were  built  on  a  large  english  corpus, 
the  risk  of  missing  term  vectors  is  minimal.  In  case,  an 
out-of-vocabulary  term  (OOV)  was  encountered,  we  did  not 
include  the  term  vectors.  All  our  approaches  used  a  classic 
retrieval  model  -  language  model  with  dirichlet  smoothing. 
The  parameter  /r  for  this  retrieval  model  was  set  the  Indri 
default  of  2500.  For  the  sequential-dependence  model  im¬ 
plementation,  the  weights  assigned  to  the  unigram,  exact- 
match  bigram  and  window  bigram  components  were  0.5, 
0.25  and  0.25  respectively.  For  the  query-expansion  strate¬ 
gies,  at  most  5  additional  terms  with  cosine  similarity  scores 
above  0.7  were  chosen  for  both  the  strategies.  For  term 
weighting,  OOV  terms  were  assigned  a  default  weight  of  1.0. 
The  word-vector  representations  used  were  300-dimensional 
vectors  released  by  Google,  trained  on  a  Google  News  corpus 
of  about  100  billion  words. 

We  assessed  the  runs  with  the  gdeval .  pi  tool  provided 
by  TREC  and  focus  on  NDCG@20  for  the  result  merging 
task.  The  results  for  the  FedWebl3  data  are  in  Table  5. 
Performance  of  our  system  alone  on  FedWebl4  is  provided 
in  Table  6  (since  the  best  runs  were  not  available  at  the 
time  of  submission).  In  both  tables,  plain  refers  to  the 
approach  in  section  3.1,  sdm  refers  to  the  approach  in  section 
3.2,  and  Exp-Avg  and  Exp-Term  refer  to  the  query-expansion 
strategies  explained  above. 

All  the  result-merging  scores  were  based  on  baseline  resource- 
selection  runs  provided  by  the  organizers.  In  addition  to  the 
best  performing  system  we  include  the  best  performing  sys¬ 
tem  that  only  used  snippets  since  the  FedWebl3  task  allowed 
participants  to  use  the  documents  returned  by  each  of  the 
resources  during  the  result-merging  phase.  In  FedWebl4, 
only  snippets  were  available  for  use.  The  results  of  the  plain 
retrieval  model,  the  SDM  queries  and  the  expansion  strate¬ 
gies  are  also  provided  for  FedWebl4.  The  term-weighting 
approach  was  not  submitted  to  the  FedWebl4  task  and  thus 
we  only  provide  results  on  FedWebl3  for  this  approach. 

Table  5  lists  the  performance  of  our  system  and  the  best 
FedWebl3  runs.  We  observe  that  (i)  in  all  cases,  using  only 
snippets  as  opposed  to  documents  automatically  leads  to  a 


Approach 

NDGC@20 

FedWebl3-Docs-Best 

0.257 

FedWebl3-Docs-Median 

0.162 

FedWebl3-Snippets-Best 

0.161 

FedWebl3-Snippets-Median 

0.142 

FedWebl3-plain 

0.210 

FedWebl3-sdm 

0.224 

FedWebl3-expansion-l 

0.188 

FedWebl3-expansion-2 

0.201 

FedWebl3-weighting-l 

0.213 

FedWebl3-weighting-2 

0.211 

Table  5:  FedWebl3  collection  statistics. 


Approach 

NDGC@20 

FedWeb 14-Best 

0.323 

FedWeb 14-Median 

0.289 

FedWeb 14-plain 

0.277 

FedWebl4-sdm 

0.276 

FedWeb 14-expans ion- 1 

0.285 

FedWeb 14-expansion-2 

0.286 

Table  6:  FedWebl3  collection  statistics. 

massive  drop  in  system  performance.  This  is  consistent  with 
the  observations  of  the  FedWebl3  organizers  [5].  Thus,  for 
a  realistic  comparison  we  only  consider  the  best  submission 
from  FedWebl3  that  did  not  use  documents  (shown  as  FW13- 
SNIPPET-BEST).  Our  approaches  clearly  outperform  the  best 
result-merging  score  from  the  FedWebl3  track  (that  only 
considered  snippets).  In  particular  we  note  that  most  of 
the  models  perform  very  similar  to  each  other  and  there  is  a 
minor  performance  drop  when  expanding  a  query  with  terms 
close  to  the  query  mean  vector. 

We  also  report  results  for  some  of  our  techniques  on  the 
FedWebl4  corpus  (shown  in  Table  6).  In  this  corpus,  we  no¬ 
tice  that  our  approaches  are  extremely  close  to  the  median 
score  and  the  performance  gap  between  our  approach  and 
the  best  system  is  slightly  larger  than  the  gap  for  the  Fed- 
Webl3  corpus.  In  this  case,  the  expansion  strategies  slightly 
outperform  the  other  approaches. 

On  a  per-query  basis,  there  are  no  particular  kind  of  queries 
in  the  FedWebl3  corpus  that  were  aided  by  our  approaches. 
Between  the  various  approaches  implemented,  the  variance 
is  not  particularly  high. 

5.  CONCLUSIONS 

In  this  work,  we  explored  how  query  transformations  can 
be  leveraged  for  merging  results  in  the  federated  search  pipeline. 
The  first  observation  from  the  FedWebl3  collection  is  that 
when  restricted  to  using  snippets,  the  performance  drops 
quite  severely  -  an  observation  made  by  the  organizers  as 
well  in  [5].  The  best  performance  on  the  FedWebl3  dataset 
was  obtained  by  employing  sequential-dependence  models. 
The  query-expansion  approaches  did  not  provide  a  perfor¬ 
mance  improvement  compared  to  sequential-dependence  mod¬ 
els  and  classic  retrieval  models  like  the  language-model  with 
dirichlet  smoothing.  On  FedWebl4  however,  the  query- 
expansion  using  word- vector  provided  a  slight  improvement 
in  performance.  Newer  advances  in  learning  continuous  rep¬ 
resentations  of  paragraphs  or  documents  (as  demonstrated 
by  [8])  can  be  leveraged  in  the  future  to  provide  a  more  prin¬ 


cipled  approach  to  query  expansion  and  document(snippet) 
representation. 
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